An Improved Logging and Checkpointing Scheme for Recoverable Distributed Shared Memory

نویسندگان

  • Taesoon Park
  • Sung Bok Cho
  • Heon Young Yeom
چکیده

The distributed shared memory(DSM) system transforms an existing network of workstations to a powerful shared-memory parallel computer which could deliver superior price/performance. However, with more workstations engaged in the system and longer execution time, the probability of faults increases which could render the system useless. Several checkpointing and logging schemes have been proposed to enable the DSM system to continue work after transient failures. Using checkpoints, it is not necessary to roll back to the beginning of the process but the processes need to roll back to the latest checkpoint. The logging is introduced to further reduce the amount of rollback propagation on other related processes. Although logging makes the rollback propogation unnecessary , it introduces the overhead for the logging itself. If it is needed to log all the read/write operations, the logging overhead would be prohibitive. Moreover, some of the logging methods proposed earlier could result in incorrect recovery when processes synchronize using barriers. In this paper, we propose a novel logging scheme which greatly reduces the amount of logging by not loging all the pages accessed but logging only the pages which are invalidated. The performance our proposed scheme is analyzed using extensive simulation. Compared with two other schemes proposed earlier, our new logging scheme shows superior performance in various cases.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An efficient causal logging scheme for recoverable distributed shared memory systems

This paper presents a causal logging scheme for the lazy release consistent distributed shared memory systems. Causal logging is a very attractive approach to provide the fault tolerance for the distributed systems, since it eliminates the need of stable logging. However, since inter-process dependency must causally be transferred with the normal messages, the excessive message overhead has bee...

متن کامل

Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory

Distributed shared memory provides a useful paradigm for developing distributed applications. As the number of processors in the system and running time of distributed applications increase, the likelihood of processor failure increases. A method of recovering processes running in a distributed shared memory environment which minimizes lost work and the cost of recovery is desirable so that lon...

متن کامل

Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory

Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory (DSM) systems is expensive because of high communication frequency. In this paper we show that, if designed correctly, a DSM system only needs to consider dependencies due to the transfe...

متن کامل

A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability

Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM). Although most recover...

متن کامل

An E cient Logging and Recovery Scheme for LazyRelease Consistent Distributed Shared

Checkpointing and logging are widely used techniques to provide fault tolerance for the distributed systems. However, logging imposes too much overhead on the processing to be a practi-val solution. In this paper, we propose a low-overhead logging scheme for the distributed shared memory system based on the lazy release consistency model. Unlike the previous schemes in which the logging is perf...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996